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ABSTRACT 


Many  problems  in  applied  mathematics  require  the  evaluation  of  the  sum  of  N  Gaussians 
at  M  points  in  space.  The  work  required  for  direct  evaluation  grows  like  NM  as  N  and  M 
increase;  this  makes  it  very  expensive  to  carry  out  such  calculations  on  a  large  scale.  In  this 
paper,  we  present  an  algorithm  which  evaluates  the  sum  of  N  Gaussians  at  M  arbitrarily 
distributed  points  in  C-(N-rM)  work,  where  C  depends  only  on  the  precision  required.  When 
N  —  M  =  100,000,  our  algorithm  is  several  thousand  times  faster  than  direct  evaluation. 
It  is  based  on  a  divide  and  conquer  strategy,  combined  with  the  manipulation  of  Hermite 
expansions  and  Taylor  series. 
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Many  problems  in  applied  mathematics  involve  the  Gauss  transform 

Gsf(x)  =  Jre-'*-yf'sf(y)dy  (6  >  0)  (1) 

of  a  function  /  defined  on  T  C  R^.  The  simplest  example  is  the  heat  equation.  The  solution 
of  the  pure  initial  value  problem 


ut(x,t)  =  A u(x,t)  for  t  >  0 
u(x,0)  =  f(x)  for  i€Rj 


is  given  by 

u(x,  t)  =  (47 rt)~d/2  G«/(x) , 

with  T  equal  to  the  whole  space.  A  similar  transform,  with  T  a  lower-dimensional  subset 
of  Rd,  occurs  when  one  solves  any  initial/boundary  value  problem  for  the  heat  equation  by 
means  of  potential  theory  [1,  6,  9,  13,  14].  Other  examples  occur  in  vortex  methods  [3], 
tomography  [11],  and  nonparametric  statistics  [7,  15].  Finally,  a  common  analytical  tool 
is  mollification ;  one  approximates  an  arbitrary  function  /  by  the  family  of  smooth  rapidly 
decreasing  functions 

fs(x)  =  (n6)-d'2Gsf(x) 

which  converge  to  /  as  8  — >  0. 

For  numerical  purposes,  one  must  discretize  G$f.  Given  the  values  of  /  at  a  set  of 
points  sj  €  Rd,  one  can  approximate  the  integral  (1)  by  means  of  a  quadrature  formula.  A 
reasonable  approximation  to  G$f  might  then  take  the  form  of  a  discrete  Gauss  transform 


G(x)  =  f]9je- 


j=l 


(2) 


where  the  coefficients  q}  depend  on  the  values  f{sj)  and  the  weights  of  the  chosen  quadrature 
formula.  This  paper  will  focus  on  the  problem  of  evaluating  f  .um  of  Gaussians  (2)  as 
efficiently  as  possible.  It  will  often  be  convenient  to  speak  of  (2)  \  the  Gaussian  “field”  due 
to  sources  of  strengths  qj  at  the  points  s;,  evaluated  at  the  “target  x. 

Suppose  now  that  we  want  to  evaluate  (2)  directly  from  the  definition  at  M  targets 
x  =  t{.  In  other  words,  we  want  to  apply  the  rectangular  matrix  with  entries 

Gij  = 

to  the  vector  ?  =  (?i,  •  •  • ,  This  requires  O(NM)  work,  which  grows  rapidly  as  M  and 

N  increase,  and  makes  large  scale  calculations  prohibitively  expensive. 

In  this  paper,  we  present  an  algorithm  for  evaluating  (2)  at  M  points  in  0(N  +  M)  work. 
The  constant  in  0(N  +  M)  depends  only  on  the  dimension  d  and  the  desired  precision. 
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The  amount  of  memory  required  is  also  proportional  to  N  +  M,  so  that  the  algorithm  is 
asymptotically  optimal  in  terms  of  both  work  and  storage.  Furthermore,  the  sources  and  the 
targets  can  be  placed  anywhere;  they  need  not  be  restricted  to  a  regular  grid.  Even  if  the 
function  /  were  given  at  N  equispaced  points  and  Gsf  evaluated  at  N  equispaced  points, 
fast  convolution  by  means  of  the  Fast  Fourier  transform  (FFT)  would  require  0{N  log  N) 
operations,  whereas  our  algorithm  requires  only  O(N). 


2  Hermite  expansions 

This  section  describes  the  properties  of  the  Gaussian  kernel  and  Hermite  expansions  which 
we  will  need.  Good  references  for  this  material  are  [4,  5,  8]  and  particularly  Hide's  paper 
[10]. 

The  Hermite  polynomials  Hn(t)  may  be  defined  by  the  Rodrigues  formula 


Hn(t)  =  (-l)V2i7V 


t  e  R 


where  D  =  d/dt.  We  will  make  use  of  this  definition  as  well  as  the  generating  function  for 
Hermite  polynomials 


oo  cn 

r— \ 


-  E  3«.<o 


Multiplication  of  each  side  of  the  preceding  expression  by  e  ‘2  yields 


~  sn 


where  the  Hermite  functions  hn(t)  are  defined  by 


hn(t)  =  e~l  Hn(t).  (3) 

(Note  that  these  are  not  the  usual  orthonormal  Hermite  functions;  the  definition  here  is  the 
right  one  for  this  situation.)  In  practice,  we  will  use  a  shifted  and  scaled  version  of  this 
formula:  for  so  €  R  and  6  >  0,  we  have 


e-(t-af/s  _  e-(t-ao-(3-Jo))2/« 

=  EiterV 


t  -  s0 


_  P-h-*o)7<5  v'  JL  i  £ _ £° 

k  V 


t  -  s0\ 

V~6  )' 


This  formula  tells  us  how  to  evaluate  the  Gaussian  field  at  the  target  t  due  to 

the  source  at  s,  as  an  Hermite  expansion  centered  at  s0.  Thus  we  are  shifting  a  Gaussian  at 
s  to  a  sum  of  Hermite  polynomials  times  a  Gaussian,  all  centered  at  sq. 


We  can  also  interchange  t  and  s  to  write 


Looked  at  this  way,  the  expansion  expresses  a  Gaussian  with  target  t  as  a  Taylor  series  about 
a  nearby  target  t0 ;  the  coefficients  of  the  Taylor  series  are  the  Hermite  functions  evaluated 
at  t0-  Thus  the  same  expansion  serves  as  both  a  near-field  (Taylor)  and  a  far-field  (Hermite) 
expansion.  The  final  one-dimensional  results  we  will  need  are  the  recurrence  relation 

hn+i(t)  =  2t  hn(t)  -  2n  hn-i(t)  t  6  R, 

for  Hermite  functions  and  Cramer’s  inequality  for  Hermite  polynomials: 

\Hn(t)\  <  I<2n/2V^.et2/2 

where  K  is  a  numerical  constant  less  than  1.09  in  value.  Cramer’s  inequality  immediately 
implies  the  following  useful  bound  for  Hermite  functions: 

n!  Vn! 

We  will  also  need  the  straightforward  extensions  of  these  facts  to  higher  dimensions. 
Thus,  let  t  and  s  lie  in  d-dimensional  Euclidean  space  Rd,  and  consider  the  Gaussian 

e-|«-»ls  _  e-(«i-«i)2--.-(td-Sd)2_ 

We  will  find  it  convenient  to  adopt  multiindex  notation.  A  multiindex  a  =  (au,a:2,  -  • .  ,Od) 
is  a  d-tuple  of  nonnegative  integers,  playing  the  role  of  a  multidimensional  index.  For  any 
multiindex  a  and  any  t  G  Rd,  we  define 

|aj  =  ai  +  a2  +  •  •  •  + 


a!  =  Qi!a2l  •  •  • 


ta  =  t“*t“2  ...tadd 


Da  =  d?ld%2 

where  dt  is  differentiation  with  respect  to  the  eth  coordinate  in  Rd.  If  p  is  an  integer,  we  say 
a  >  p  if  cti  >  p  for  1  <  *  <  d. 

The  multidimensional  Hermite  polynomials  and  Hermite  functions  are  then  defined  by 


Ha(t)  =  Hai(tx)...Had(td) 

ha(t)  =  e~WHa(t)  =  hai(t1)...hQd(td)  (5) 

where  \t\2  —  t\  -(- . . .  +  t%. 
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(6) 


The  Hermite  expansion  of  a  Gaussian  in  Rd  is  then  simply 

e-'*-'1  =  £  h.(.  -  so) . 


Cramer’s  inequality  generalizes  to 


-i|MOI  <  Ke~W/2  2|o|/2-i=, 


where  K  is  less  than  (1.09)<i. 

Finally,  our  algorithm  will  require  the  Taylor  expansion  of  the  Hermite  function  ha(t) 
about  an  arbitrary  point  f0  £  Rrf.  Since  ha  is  defined  by 


ha(t)  =  e-t2Ha(t) 

=  (-l)|Q|D^e-t2  , 


applying  D0  gives  immediately 


Thus  the  Taylor  series  of  hQ  is 


Deha(t)  =  (- 


M<)=£^^^(-i)wW‘o).  (10) 

0>O  P' 

We  now  present  the  three  lemmas  on  which  our  algorithm  relies.  The  first  describes  how 
to  transform  the  field  due  to  all  sources  in  a  box  into  a  single  rapidly  converging  Hermite 
expansion  about  the  center  of  the  box. 

Lemma  2.1  Let  NB  sources  Sj  He  in  a  box  B  with  center  sb  and  side  length  r\/28,  with 
r  <  1.  Then  the  Gaussian  field  due  to  the  sources  in  B , 


;= i 


is  equal  to  a  single  Hermite  expansion  about  sb  • 


G(t)=  £  Aaha  ( . 


The  coefficients  Aa  are  given  by 


1  ^  s:-sB 

Aa  ~  X,  ‘M  ./I 
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(12) 


The  error  En{p )  due  to  truncating  the  series  after  pd  terms  satisfies  the  bound: 


\EH(p)\  =  \J2Aaha  (L-^.j\<KQB 


1  V'2  (rp+'  V 
p!  ]  Vl-rj 


where  QB  =  £  !<?_,- 1  and  I<  =  (1.09)d. 

Proof:  Use  (6)  to  expand  each  Gaussian  in  the  sum  (11)  into  a  Hermite  series  about  sb 

and  interchange  the  sums  over  a  and  j.  The  truncation  error  bound  follows  from  Cramer’s 
inequality  (7)  and  the  formula  for  the  tail  of  a  geometric  series.  □ 

The  second  lemma  shows  how  to  convert  an  Hermite  expansion  about  sb  into  a  Taylor 
expansion  about  tc .  The  Taylor  series  converges  rapidly  in  a  box  of  side  r\/26  about  tc , 
with  r  <  1. 


Lemma  2.2  The  Hermite  expansion 


G(t)  =  £  Aaha( 

a>0  \ 


t  —  SB 

~7T 


has  the  following  Taylor  expansion,  about  an  arbitrary  point  tc: 


«"  ■ ”  Px)' 


The  coefficients  Bp  are  given  by 


Bn  = 


(-0W 


’■  £  A°h°+‘!(SBrfC) 


If  the  Aa  are  given  by  (12)  then  the  error  Er{p)  in  truncating  the  Taylor  series  after  pd 
terms  is  bounded,  in  the  box  C  with  center  tc  and  side  length  r\/28,  by 


*  '-KQb{* 


i \d/2  /V+1  \d 


1  —  r 


if  r  <  1 . 

Proof:  Each  Hermite  function  in  (13)  can  be  expanded  into  a  Taylor  series  by  means  of 

equation  (10).  The  expansion  (14)  is  then  obtained  by  interchanging  the  order  of  summation. 
The  truncation  error  bound  is  only  a  little  more  difficult:  By  the  formula  (12)  for  Aa ,  we 
have 

d  _  (-1)l/31  V'  a  l  fsB~tc\ 

B0  -  at  Acha+0  r- 


(_1)W 


(-l)w 


1  ( 8j 


SB  ~  tc 


SB  ~  tc 
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But  by  (10),  the  inner  sum  is  the  Taylor  expansion  of  hg((sj  —  tc)/\/S)-  Thus 


B0  = 


(-I)1*31 

/?'• 


Nb 

52  <1: 

i= i 


and  Cramer’s  inequality  implies 

i  o|/3|/2 

K  Qb2^/2  yf$\  <  I<  Qb  -jp  • 


(16) 


(17) 


The  truncation  error  bound  follows,  as  in  Lemma  2.1,  from  summing  the  tail  of  a  geometric 
series.  □ 

For  our  algorithm,  we  will  need  a  variant  of  Lemma  2.2  in  which  the  Hermite  series  is 
truncated  before  converting  it  to  a  Taylor  series.  This  essentially  means  that  in  addition  to 
truncating  the  Taylor  series  itself,  we  are  also  truncating  the  infinite  sum  expression  (15)  for 
the  coefficients.  Fortunately,  however,  the  error  due  this  approximation  of  the  coefficients 
turns  out  to  be  much  smaller  than  the  truncation  error  of  the  Taylor  series. 


Lemma  2.3  A  truncated  Hermite  expansion 


gw-^M‘2b) 

has  the  following  Taylor  expansion  about  an  arbitrary  point  tc-' 


G(t )  =  Z  ce 

0>O 


t-tc 

,  y/s  , 


p 


The  coefficients  Cg  are  given  by 


Cp 


(-1)1'31 


^  -da  ha+ff 

a<p 


( SB  ~  tc\ 

[  V6  I’ 


(18) 


If  the  Aa  are  given  by  (12)  then  the  error  Er{p)  in  truncating  the.  Taylor  series  after  pd 
terms  is  bounded,  in  the  box  C  with  center  tc  and  side  length  r\/28,  by 

mP)i  =  ^pCe(^)\<K'Qe 

where  K'  <  K  (1  +  ( p')~d ^2)  £  2 K  if  r  <  1/2. 
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Proof:  The  proof  is  an  application  of  the  triangle  inequality.  Write  Cp  as  the  coefficient 

Bp  from  Lemma  2.2  plus  the  tail  of  a  series 


C«  = 


(-DW^  1  (sj-ssV  ( 

t*  £'£<4  Vs  ) 

Ie-eUi^ 


SB  “  tc 


j= 1  \a>0  cr>p 


SB  ~  tc 


=  Bo- 


(-I)W 


p-  j  =  l  o>p  a- 


1  (sj-  SB 


a!  V  VS 


Sb  —  tc 


—  Bp  +  ( Cp  —  Bp) . 


Then 


(tt)‘ 


0>P  \  vu  /  p>p 

From  Lemma  2.2,  we  know  that  the  first  sum  is  bounded  by 

Hence  we  need  only  bound  the  second  sum.  For  this,  we  have 


k-«  (W 


Qb 

0>p 


t  -  tc 

Vs 


1  v-'  I  /  si  ~  SB 
1  .  /~k 


| \ha+0 


SB  ~  tc 


„  V'  V-  /(5Taj!  rlOl 


(Q  +  ^)!  <  2l®+0l 

or!/?!  - 

so  the  lemma  follows  immediately.  □ 

Remark  2.1.  The  alternate  expression  (16)  for  which  appears  in  the  proof  of  Lemma 
2.2  has  a  simple  meaning.  Rather  than  using  Lemma  2.1  to  accumulate  all  the  Gaussians 
into  a  single  Hermite  expansion  and  then  shifting  it  to  tc ,  we  can  use  Lemma  2.3  to  shift 
each  Gaussian  individually  to  tc  and  add  up  the  resulting  Taylor  coefficients.  (A  Gaussian  is 
a  one-term  Hermite  series,  after  all,  and  can  therefore  be  shifted  just  like  any  other  truncated 
Hermite  series.)  Thus,  a  Gaussian 

G(t)  =  qe-]i~Vys 


( 


has  the  following  Taylor  expansion  about  tc; 

The  coefficients  Bp  are  given  by 

d  _  .M)1*1,.  (SJ  ~tc\ 
Ba~q—\ ~hl>  [—  ) 

and  the  error  in  truncation  after  pd  terms  is 


\Er(p)\  =  I  E  Bp 

0>P 


t  —  tc 

~7T 


l\d/2  (  rP+1 


—  K  Q  I  ~T 


?\  V1_r, 


for  r  <  1. 


(20) 


3  The  fast  Gauss  transform 

We  now  have  the  tools  necessary  to  construct  and  analyze  a  fast  algorithm  for  evaluating 
the  discrete  Gauss  transform 

G(U)  =  Y.(h^t'~a,?IS  (21) 

j=i 

for  1  <  r  <  M  in  0(M  4-  iV)  work.  By  shifting  the  origin  and  rescaling  6  if  necessary,  we 
can  assume  (as  a  convenient  normalization)  that  the  sources  s:  and  targets  tx  all  lie  in  the 
unit  box  B0  =  [0,  l]d. 

The  algorithm  is  based  on  subdividing  B0  into  smaller  boxes  with  sides  of  length  r\/26 
parallel  to  the  axes,  with  a  fixed  r  <  1/2.  We  can  then  assign  each  source  Sj  to  the  box 
B  in  which  it  lies  and  each  target  t,  to  the  box  C  where  it  lies.  For  the  sake  of  clarity,  we 
maintain  a  notational  distinction  between  source  boxes  B  and  target  boxes  C  even  though 
they  may  be  the  same. 

For  each  target  box  C,  we  need  to  evaluate  the  total  field  due  to  sources  in  all  boxes  J3, 
at  each  target  in  C.  Because  the  range  of  the  Gaussian  is  (^(v^)  a  id  the  boxes 

have  side  lengths  r\/26,  only  a  fixed  number  of  source  boxes  B  can  contribute  more  than 
Qe  to  the  field  in  a  given  target  box  C,  where  Q  =  J^jLi  |<7jl  and  e  is  a  specified  precision. 
Indeed,  if  we  cut  off  the  sum  over  all  B  after  including  the  (2n  +  l)d  nearest  boxes  to  C,  we 
incur  an  error  bounded  by  Qe"2r2"J.  We  can  always  choose  n  depending  only  on  r  and  e  to 
make  this  less  than  Qe.  For  example,  if  r  =  1/2  we  get  single  precision  accuracy  relative  to 
Q  with  n  =  6  and  double  precision  with  n  =  8. 

Suppose  now  that  we  want  to  evaluate  the  field  due  to  a  box  B  with  Nb  sources  at  Me 
targets  in  a  box  C.  There  are  two  ways  in  which  B  can  transmit  its  influence,  and  two  ways 
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J 
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in  which  C  can  handle  the  information  it  receives.  B  can  directly  send  out  the  strengths 
and  centers  of  all  Nb  Gaussians  located  in  B,  or  it  can  use  Lemma  2.1  to  collect  them  into 
a  single  truncated  Hermite  expansion.  C  can  then  directly  evaluate  all  fields  (Gaussians  or 
Hermite  expansions)  sent  to  it,  at  the  Ale  target  locations  in  C,  or  it  can  use  Lemma  2.3  to 
convert  the  fields  sent  to  it  into  a  single  truncated  Taylor  expansion  about  the  center  tc  of 
C.  Evaluation  of  this  Taylor  series  then  yields  the  total  field  at  each  target  location. 

Thus,  there  are  four  possible  ways  in  which  B  can  influence  C . 


1.  Nb  Gaussians  — >  directly  evaluated 

2.  Nb  Gaussians  — >  accumulated  in  Taj  lor  series  via  (20) 

3.  Hermite  series  — >  direc*'y  evaluated 

4.  Hermite  series  — >  accumulated  in  Taylor  series  via  (18) 

A  fast  algorithm  can  be  based  on  any  one  of  the  second  through  fourth  alternatives, 
because  they  all  decouple  the  number  of  sources  from  the  number  of  targets.  Methods  1 
through  4  require  the  following  work  to  evaluate  G(t)  at  AI  targets  f,. 

1.  The  cost  of  evaluating  N  Gaussians  at  M  points  is  of  the  order  0(N Af). 


2.  Consider  a  fixed  source  box  B.  Foi  each  target  box  C  within  range,  we  must  compute 
pd  Taylor  series  coefficients 


C0(B)  = 


(~l)|g| 

/?! 


E  ht 


a,eB 


Sj  te 

~7T, 


(22) 


Each  coefficient  requires  0(Nb )  work  to  evaluate,  resulting  in  a  net  cost  of  the  order 
0(pdNB)-  Since  there  are  at  most  (2n  +  1)^  boxes  within  range,  the  total  work  for 
forming  all  Taylor  series  is  of  the  order  0((2n  +  \)dpdN).  Now,  for  each  target  f,-,  one 
must  evaluate  the  pd  term  Taylor  series  corresponding  to  the  box  C  in  which  f,  lies. 
The  total  cost  of  this  algorithm  is,  therefore, 


0({2n  -f  1  )dPdN)  +  0{pdAI)  . 


3.  In  the  third  approach,  we  form  a  Hermite  series  for  each  box  B  and  evaluate  it  at  all 
targets.  First,  using  Lemma  2.1,  we  write 

G(t)  =  E  E 

B  3j£B 

=  +  E»(p) 
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t 


where  \En(p)\  <  Qt  and 

<23) 

To  compute  each  Aa(B)  costs  O(Nb)  work,  so  forming  all  the  Hermite  expansions 
requires  0{pdN )  work.  Evaluating  at  most  (2 n  +  l)d  expansions  at  each  target  tt  costs 
0((2n  +  l)dpd)  work  per  target,  so  this  approach  results  in  a  total  work 

0(pdN)  +  0((2n  +  \)dpdM)  . 

4.  Finally,  suppose  we  accumulate  all  sources  into  truncated  Hermite  expansions  and 
transform  all  Hermite  expansions  into  Taylor  expansions  via  Lemma  2.3.  Thus,  we 
approximate  G(t)  in  C  by 

G{t)  =  EE^NP/S 

B  3j  $B 

=  E  ^  (~v^C)  +  Et{p)  +  Eff(p) 

where  \EH(p)\  +  \&t(p)\  <  Qg 

c0  =  £  E  Aa(B)  ha+p  (£^£)  ,  (24) 

and  the  coefficients  Aa(B)  are  given  by  (23).  As  we  saw  under  the  third  approach,  it 
costs  0(pdN)  work  to  form  all  the  Hermite  expansions,  i.e.  to  compute  the  coefficients 
Aa(B)  for  a  <  p  and  all  source  boxes  B.  Because  of  the  product  form  (5)  of  ha+p, 
the  computation  of  the  pd  coefficients  Cp  involves  only  0(dpd+1)  operations  for  each 
box  B  in  range.  Therefore,  a  total  of  0((2n  -f  1  )ddpd+l)  work  per  target  box  C  is 
required.  Finally,  evaluating  the  appropriate  Taylor  series  for  each  target  t,  requires 
0{pdM)  work.  Hence  this  algorithm  has  net  CPU  requirements  of  the  order 

0((2n  T  1  )ddpd+xNhox)  +  0(pdN)  +  0(pdM)  , 

where  the  number  of  boxes  Nbox  is  bounded  by  min{{ry/28)~dl2 ,  M).  Note  that  the 
factor  (2n  +  \  )d  no  longer  multiplies  either  the  O(N)  or  O(M)  terms.  The  work  is  now 
decoupled  into  three  parts;  0(pdN)  to  form  Hermite  expansions,  0(pdM)  to  evaluate 
Taylor  series,  and  a  constant  term  depending  on  the  number  of  box-box  interactions 
and  the  cost  of  transforming  a  Hermite  expansion  into  a  Taylor  series. 

Thus,  we  really  have  four  algorithms  for  evaluating  G(t),  three  of  which  are  asymptotically 
optimal.  We  can  try  to  min:’  ".c  th';  constants  in  the  work  estimate  by  varying  the  choice 
of  algorithm  from  box  to  1  ox  Clearly  an  optimal  strategy  for  this  choice  is  global,  but  a 
reasonable  strategy  can  be  .  ..^ructed  in  which  each  box  decides  independently  what  action 
to  take.  For  this  purpose,  let  AV  1  >urces  in  a  box  B  be  within  range  of  Me  targets  in  a  box 
C.  Choose  cutoff  parame  ?rs  /V  and  Mi.  Then 
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1.  if  Nb  <  Nf  then  B  sends  out  Nb  Gaussians. 

2.  if  Nb  >  Np  then  B  sends  out  a  Hermite  expansion. 

3.  if  Me  <  Ml  then  C  evaluates  all  holds  sent  to  it  immediately. 

4.  if  Me  >  Ml  then  C  transforms  all  fields  sent  to  it  into  Taylor  series,  accumulates  the 
coefficients,  and  only  then  evaluates  the  Taylor  series. 

The  work  in  this  algorithm  can  be  broken  down  as  follows: 

1. 


E  o(pjnb) 

Nb>Nf 

to  evaluate  Hermite  expansions, 

2. 


to  evaluate  Gaussians, 
3. 


+  E  E  o(nbMc) 

mc<ml  Nb<Nf 


+  E  E  0(pdMc) 

Mc<Ml  Nb>N/? 

to  evaluate  Hermite  expansions, 

4. 


+  E  E  <V*s) 

Mc>Mi  Nb<Nf 

to  transform  Gaussians  into  Taylor  series, 

5. 


+  E  E  <W+') 

Mc>Ml 

to  transform  Hermite  series  into  Taylor  series, 

6. 


+  E  0(pdMc) 
mc>ml 

to  evaluate  Taylor  series. 

Clearly  we  can  achieve  a  rough  balance  of  work  by  taking  Np  =  0(pd~l)  and  Ml  =  0(pd~l ). 
The  total  work  then  has  the  form  0(pdN)+0(dpd+l(2n  +  l)dmin((r\/26)~d/2,  M))  +  0(pd M). 
This  is  linear  in  N  and  M,  with  a  constant  depending  only  on  the  precision.  The  complexity 
estimate  is  similar  to  the  fourth  algorithm  above,  but  the  advantage  here  is  that  when  there 
are  only  a  few  particles  in  a  box,  the  overhead  associated  with  transformation  of  Hermite 
series  to  Taylor  series  is  avoided. 
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4  Formal  Description  of  the  Algorithm 

In  this  section,  we  describe  the  fast  Gauss  transform  in  a  more  procedural  form. 

Algorithm 

Comment  [Choose  the  largest  r  <  1/2  such  that  l/ri/26  is  an  integer  JV4,^e.  Subdivide  the  unit 
box  into  Ndide  boxes.  Choose  the  number  n  of  boxes  to  go  out  in  each  direction  based  on  r  and  ihe 
desired  precision  e.  Each  source  sends  to  (2n  +  l)d  boxes.  Choose  the  number  of  terms  pd  based 
on  r  and  e.  Choose  the  cutoffs  Np  and  Ml-] 


Step  1. 

Assign  sources  and  targets  to  boxes.  Determine  number  of  boxes  containing  more 
than  Ml  targets.  For  each  such  box,  allocate  storage  for  a  Taylor  series 
with  pd  terms  and  initialize  to  zero. 


Step  2. 

Comment  [Loop  through  boxes,  computing  interactions  between  sources  in  box  and  targets  within 
range  n.  For  each  pair  of  source  and  target  boxes,  one  of  the  four  options  summarized  on  p.  10  is 
used.] 

do  i  =  l,-,Ndde 

Nb  =  number  of  sources  in  iih  box  B. 

Form  the  interaction  list  of  (2n  +  l)d  target  boxes  C  within  range  of  B. 
if  Nb  <  Np  then 

do  j  =  1, ...,  (2n  +  l)d 

Me  =  number  of  targets  in  jlh  box  C  in  interaction  list. 
if  Me  <  Ml  then 

Compute  source/target  interactions  by  direct  evaluation  of  Gaussians. 
else 

Convert  each  of  the  Nb  sources  into  a  Taylor  series  about  the  center  of 
box  C  via  equation  (20)  and  add  to  Taylor  series  for  box  C. 

end  if 
end  do 

else  (Nb  >  Np) 

Form  Hermite  expansion  about  center  of  box  B  due  to  Nb  sources  via  equation  (20). 
do  j  =  1, ...,  (2n  +  l)d 

Me  =  number  of  targets  in  jth  box  C. 
if  Me  <  Ml  then 

Evaluate  Hermite  expansion  at  each  target  location  and  add  to  accumulated 
potential. 

else 

Convert  Hermite  expansion  into  a  Taylor  series  about  the  center  of  box  C 
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by  means  of  equation  (12)  and  add  to  Taylor  series  for  box  C. 

end  if 
end  do 
end  if 
end  do 


Step  3. 

Comment  [Loop  through  boxes  evaluating  Taylor  series  for  boxes  with  more  than  Ml  targets.] 
do  i  —  1, Ndilie 

Me  =  number  of  targets  in  iik  box  C. 
if  Me  >  Ml  then 

Evaluate  Taylor  series  for  box  C  at  each  of  the  Me  target  positions 
to  obtain  the  desired  potential, 
end  if 
end  do 


5  Numerical  Experiments 

In  this  section,  we  present  the  results  of  numerical  experiments  with  the  fast  Gauss  transform 
and  demonstrate  dramatic  speedups  over  the  direct  calculation  for  realistic  problems.  A 
two-dimensional  version  of  the  algorithm  was  programmed  in  Fortran  and  run  on  a  Sun-4 
workstation,  using  up  to  100,000  sources  and  targets  and  8  lying  in  the  range  .0001  to  1.0. 

We  examined  the  cost  of  the  fast  algorithm  as  compared  to  the  direct  evaluation  of  all 
the  Gaussians,  as  N  and  M  increased.  Two  distributions  of  targets  and  sources  were  tried, 
uniformly  distributed  in  the  unit  box  and  equally  spaced  on  a  circle,  and  several  values  of  8 
were  used.  For  the  uniform  distribution,  strengths  were  uniformly  distributed  between  —  1 
and  1.  For  the  circle,  the  strengths  were  specified  to  be  cos(0),  where  9  is  the  angle.  We 
asked  for  a  error  relative  to  the  total  charge  Q  of  e  =  10— 6 ,  which  required  pd  =  82  terms  in 
the  Hermite  and  Taylor  expansions.  The  results  are  given  in  Tables  1-4. 

With  102,400  sources  and  targets  equispaced  on  a  circle  and  8  =  .01,  the  fast  algorithm 
is  more  than  3000  times  faster  than  the  direct  calculation.  Typically  the  performance  of 
the  fast  algorithm  improves  when  the  source  distribution  is  spatially  non-uniform,  as  it  is 
in  many  practical  problems.  There  are  then  more  particles  in  each  of  a  smaller  number  of 
occupied  boxes,  reducing  overhead  costs. 

Storage  requirements  for  the  fast  algorithm  are  reasonable  as  long  as  N  and  M  are  large 
and  8  is  not  too  small.  For  extremely  small  8 ,  one  should  modify  the  algorithm  to  make  use 
of  just  those  boxes  containing  targets  or  sources.  The  interaction  list  for  each  source  box 
can  then  be  formed  by  means  of  an  adaptive  tree  structure,  as  is  done  in  the  fast  multipole 
method  [2].  This  would  avoid  having  to  loop  through  Ndide  =  (2 r28)~d^2  (largely  empty) 
boxes. 
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Table  1 

Table  of  cost  and  errors  for  8  =  1.0,  with  targets  and  sources  distributed  uniformly  in  the  unit  box.  CPU 
times  are  given  in  seconds  for  the  fast  and  direct  algorithms.  Times  for  the  direct  algorithm  were 
estimated  by  evaluating  G(£)  at  100  targets  and  extrapolating.  All  information  was  rounded  to  two 
significant  digits. 


Case 

N  =  M 

Fast 

Direct 

Error/ Q 

1 

100 

0.42 

0.59 

.627E-08 

2 

200 

0.62 

2.3 

.306E-08 

3 

400 

1.1 

9.7 

.175E-08 

4 

800 

1.8 

38 

.157E-08 

5 

1600 

3.4 

150 

.126E-08 

6 

3200 

6.5 

601 

.135E-08 

7 

6400 

12.8 

2407 

.114E-08 

8 

12800 

26.0 

9702 

.563E-09 

9 

25600 

51.9 

38790 

.563E-09 

10 

51200 

103 

155550 

.337E-09 

11 

102400 

205 

622780 

.237E-09 

Table  2 

Table  of  cost  and  errors  for  8  —  0.01,  with  targets  and  sources  distributed  uniformly  in  the  unit  box. 


Case 

II 

£; 

Fast 

Direct 

Ettot/Q 

1 

100 

0.65  1 

0.70 

.115E-08 

2 

200 

1.840 

2.76 

.616E-09 

3 

400 

5.8 

10.9 

.478E-09 

4 

800 

20.6 

43.6 

.291E-08 

5 

1600 

115 

174 

.274E-08 

6 

3200 

349 

697 

.443E-08 

7 

6400 

344 

2792 

.249E-08 

8 

12800 

353 

11173 

.177E-08 

9 

25600 

383 

44650 

.144E-08 

10 

51200 

431 

179120 

.120E-08 

11 

102400 

538 

716760 

.501E-09 
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Table  3 

Table  of  cost  and  errors  for  8  =  0.01,  with  targets  and  sources  spaced  uniformly  on  a  circle. 


Case 

II 

Fast 

Direct 

Error/ Q 

1 

100 

3.25 

0.67 

.479E-09 

2 

200 

4.93 

2.70 

.489E-09 

3 

400 

10.9 

10.8 

.182E-06 

4 

800 

12.7 

42.8 

.191E-06 

5 

1600 

14.5 

172 

.203E-06 

6 

3200 

18.5 

684 

.204E-06 

7 

6400 

26.4 

2749 

.201E-06 

8 

12800 

38.7 

11003 

.177E-06 

9 

25600 

65.1 

43955 

.172E-06 

10 

51200 

116 

176300 

.170E-06 

11 

102400 

219 

705650 

.170E-06 

Table  4 

Table  of  cost  and  errors  for  8  =  0.0001,  with  targets  and  sources  spaced  uniformly  on  a  circle. 
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6  A  Generalization 

The  fast  Gauss  transform  generalizes  immediately  to  sums  of  the  form 

(-1)MD"C(0  =  (25) 

)=1 

convolutions  with  a  fixed  Hermite  function.  One  need  only  apply  D a  to  all  the  formulas  pre¬ 
sented  above  and  use  the  formula  (9)  relating  derivatives  of  Hermite  functions.  An  arbitrary 
multivariable  polynomial  P(s)  can  be  expressed  as  a  sum  of  Hermite  polynomials,  so  we  can 
use  our  algorithm  to  evaluate  sums  of  the  form 

in  optimal  time.  As  an  extension  of  this  remark,  we  can  evaluate  any  convolution  sum 

K*f(t)  =  lLf(*i)K(t-Sj)  (26) 

j=i 

for  which  the  kernel  K  has  a  rapidly  converging  Hermite  series.  We  approximate  K  to  within 
c  by  a  gd-term  truncated  Hermite  expansion  and  apply  the  fast  algorithm  of  this  paper  to  f 

carry  out  each  convolution  with  an  Hermite  function.  This  would  cost  0((pq)d(N  +  M))  to 
evaluate  (26)  at  M  points.  A  better  approach,  however,  would  be  to  modify  the  algorithm 
so  as  to  create,  for  each  box,  a  single  ( p  +  g)d-term  far-field  expansion  which  includes  the 
effect  of  Hermite  functions  of  indices  up  to  q.  The  modified  algorithm  would  evaluate  (26) 
at  M  points  in  0((p  +  q)d(N  +  M ))  work;  the  constant  p  +  q  depends  only  on  the  precision 
required  and  the  smoothness  of  K . 

Examples  of  convolution  kernels  K  with  rapidly  converging  Hermite  series  include  any 
smooth  function  which  decays  at  infinity  faster  than  any  power;  in  particular,  any  smooth 
function  with  compact  support. 

7  Conclusions 

We  have  presented  a  “fast  Gauss  transform”  algorithm  for  evaluating  the  sums 

G(U)  =  'tqJe-\ti~^ls  z  =  1, . . . ,  M  (27) 

j= i 

for  1  <  i  <  M  in  0(M  +  N)  work.  Direct  evaluation  would  require  O(NM)  work  in  general, 

so  this  is  a  substantial  improvement  in  computational  complexity.  In  order  to  evaluate  the 

sum  of  100,000  Gaussians  at  100,000  points,  for  example,  the  fast  algorithm  requires  about  * 

a  minute  and  a  half  of  CPU  time  on  a  Sun-4,  while  direct  evaluation  would  take  more  than 

5  days.  There  are  many  fields  of  applied  mathematics  where  such  an  algorithm  will  be  a 

useful  tool. 
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